Tag
28 articles
Microsoft's Fara1.5 is a new family of browser computer-use agents that can navigate and interact with web interfaces to perform complex tasks. This advancement showcases the growing capabilities of multimodal AI systems in real-world, interactive environments.
Learn what a universal AI interface is and how it could revolutionize how we interact with technology by understanding multiple types of information at once.
ByteDance's Intelligent Creation Lab has released Lance, an open-source unified multimodal model capable of image and video understanding, generation, and editing in a single framework using just 3 billion parameters.
This explainer explores the advanced AI technologies behind YouTube Shorts Remix, including multimodal modeling, video understanding, and generative synthesis techniques.
Google has launched Gemini Omni Flash, a multimodal video-generation model with avatar mode and default SynthID watermarking. Speech-editing features are being held back for further development.
This article explains the advanced technical concepts behind Google's Gemini AI, including its multimodal architecture, attention mechanisms, and implications for AI development and deployment.
Thinking Machines Lab introduces TML-Interaction-Small, a 276B parameter model enabling real-time multimodal AI interaction with continuous full-duplex exchange.
Nvidia releases Nemotron 3 Nano Omni, an open multimodal AI model supporting text, image, video, and audio. The model's training data comes from sources like Qwen, GPT-OSS, Kimi, and DeepSeek OCR.
Nvidia's Nemotron 3 Nano Omni is an open-weight multimodal AI model designed to power autonomous agents on edge devices, marking a strategic shift for the company beyond hardware sales.
Encoders are the unsung heroes of AI, translating real-world data into machine-understandable formats. As they evolve, they're enabling more advanced, multimodal AI systems that mirror human cognition.
OpenAI releases GPT-5.5, bringing the company one step closer to its vision of an AI 'super app' with enhanced multimodal capabilities.
This article explains how OpenAI's ChatGPT Images 2.0 enhances image generation through reasoning capabilities and web search integration, enabling more consistent and contextually aware visual output.